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Abstract 

We  discuss  gradient- based  learning  in  recurrent  neural  networks.  The  basic  equations 
are  derived  in  a  general  framework  of  continuous- time  dynamical  systems;  from  these, 
the  well-known  discrete-time  recurrent  backpropagation  algorithms  are  deduced.  The 
convergence  properties  of  such  algorithms  are  analyzed;  in  particular,  it  is  emphasized  that 
constraints  on  the  recurrent  weights  have  to  be  enforced  to  ensure  proper  convergence. 


Keywords:  Additive  model;  Contraction  mapping;  Elman  network;  Jordan  network; 
ODE  method,  Recurrent  backpropagation  algorithm;  Recurrent  network;  Recurrent  New- 
ton algorithm. 


1  Introduction 

Recurrent  networks  differ  from  feedforward  networks  in  that  internal  feedback  connections 
among  units  are  permitted.  On  the  one  hand,  recurrent  networks  embed  richer  dynamic 
structures  so  that  they  are  able  to  capture  more  temporal  characteristics  of  the  target 
sequence,  cf.  e.g.  Jordan  (1986).  On  the  other  hand,  recurrent  networks  summarize  and 
store  past  information  compactly  in  recurrent  units  so  that  they  are  capable  of  performing 
cognition  even  when  inputs  are  static  (Norrod,  O'Neill,  &  Gat,  1987).  Thus,  recurrent 
networks  are  particularly  useful  for  applications  in  which  temporal  structure  plays  an 
important  role.  For  such  and  many  other  reasons,  recurrent  networks  have  been  the 
subject  of  broad  research  interest  within  the  last  few  years.  The  fact  that  the  IEEE 
Transactions  on  Neural  Networks  have  devoted  one  of  their  1993  issues  exclusively  to  this 
topic  is  a  clear  indication  of  the  importance  of  these  systems. 

In  this  paper,  we  discuss  some  well  known  gradient-based  learning  algorithms  for  recur- 
rent networks  and  their  convergence  properties.  The  basic  equations  for  gradient  descent 
are  derived  in  a  general  framework  of  continuous-time  dynamical  systems  in  section  2. 
From  these,  the  discrete-time  recurrent  backpropagation  (BP)  algorithms  are  obtained  in 
section  3;  a  Newton-type  algorithm  is  introduced  as  well.  The  convergence  properties  of 
these  learning  rules  are  discussed  in  section  4.  In  particular,  it  is  emphasized  that  there 
are  constraints  to  the  feedback  connections  which  should  be  imposed  during  the  learning 
process  to  ensure  proper  convergence. 

2  Basics 

The  idea  of  adjusting  the  parameters  of  a  (general  nonlinear)  dynamical  system  by  gradient 
descent  on  some  performance  functional  is  neither  new  nor  specific  to  the  neural  network 
(NN)  field.  Such  methods  have  successfully  been  employed  in  systems  theory  and  control 
for  a  considerable  amount  of  time  (although  the  emphasis  in  these  areas  has  clearly  been 
on  linear  systems).  Nevertheless,  in  the  sequel  we  shall  provide  a  rather  self-contained 
discussion  of  the  basic  ideas  and  refer  to  the  neural  networks  literature  only. 

For  sake  of  notational  simplicity  we  shall  base  our  derivations  on  continuous-time 
systems;  discrete-time  models  are  discussed  in  section  3.  Hence,  consider  a  dynamical 
system  of  the  form 

u  =  f(t,u;w)  (1) 

with  initial  condition  u(t0)  =  u0.    Here,  u  is  the  ^-dimensional  vector  of  state  variables, 


1 


and  w  is  a  /z- dimensional  vector  of  parameters  ("weights")  which  is  to  be  adjusted  to 
achieve  certain  "suitable"  behavior  of  the  system.  The  exact  form  of  /  is  irrelevant  for  the 
derivation  of  the  basic  equations,  but  may  heavily  influence  their  actual  implementation. 
In  typical  NN  applications,  the  i-th  component  fi(t,u;  w)  of/  is  of  the  form  /{(ut-,neti,x,), 
where  net,  =  J2j  WijUj  and  x,  are  the  net  and  external  inputs  to  unit  i  at  time  t,  respec- 
tively, W{j  is  the  (i,j)-th  element  of  the  matrix  of  network  interconnection  strengths  W , 
and  the  parameter  vector  w  contains  the  elements  of  W.  E.g.,  we  could  use  w  =  vec(W), 
where  the  vec  operator  stacks  one  column  of  a  matrix  underneath  the  other.  In  particular, 
in  (one  version  of)  the  standard  additive  model, 

iii  =  -ut  +  <Xt(net,)  +  x,-,  (2) 

where  <r,  is  a  squashing  activation  function.  In  what  follows,  we  shall  use  lower  and  upper 
case  letters  for  vectors  and  matrices,  respectively.  Subscripts  denote  the  corresponding 
entries. 

The  performance  of  the  system  could  be  measured  in  several  different  ways.  In  some 
applications,  it  may  depend  on  the  whole  trajectory  (u(t),to  <t<tj)  for  some  "final"  tj\ 
in  others,  only  the  states  u(tk)  at  certain  discrete  times  tk  or  even  u{t)  in  the  limit  for  t  — ► 
oo  may  be  of  interest.  We  shall  refer  to  these  cases  as  "trajectory- based"  and  "state-based", 
respectively.  Notice  that  in  the  NN  literature  "trajectory  learning"  is  typically  contrasted 
to  "fixed  point  learning".  But  clearly,  the  latter  concept  only  adequately  characterizes 
the  situations  where  performance  is  measured  in  terms  of  a  static  equilibrium  reached  by 
the  system  for  t  -+  oo,  as  is  appropriate  when  e.g.  training  a  network  to  be  an  associative 
memory,  but  not  the  state-based  cases  where  such  an  equilibrium  is  not  reached  (e.g., 
if  the  inputs  come  from  some  stationary  random  process)  or  finite  time  horizons  are  of 
interest  (e.g.,  in  language  recognition).  Of  course,  combinations  between  state-based  and 
trajectory-based  criteria  are  also  possible;  e.g.  in  a  typical  control  application,  the  goal 
might  be  to  reach  a  desired  final  state  with  as  little  cost  as  possible. 

Let  £(t)  —  £(t,u(t))  be  the  performance  of  the  system  at  time  t.  In  typical  cases,  £(t) 
is  the  instanteneous  (prediction)  error 

m  =  l  £  («*(*)- w(*))a. 

xeo(t) 

where  0(t)  and  y(t)  are  the  set  of  output  units  and  the  target,  respectively,  employed  at 
time  t.  Note  that  u  and  hence  also  I  are  of  course  functions  of  the  adjustable  parameters 
w  as  well,  although  this  dependence  is  not  made  notationally  explicit.  Similarly,  we  shall 


drop  other  arguments  to  functions  when  no  confusion  arises  by  doing  so.  To  see  how  I 
varies  with  w,  we  can  use  the  chain  rule  to  obtain 

dl    _  ^  dl  dux 
dwj      ~  dui  dwj ' 

and,  using  (1), 

d  du{    _  s-^  dfi  duk       dfi 

dt  dwj      ~  duk  dwj      dwj 

In  what  follows,  it  will  be  convenient  to  write  dg/dv  for  the  Jacobian  [dgi/dvj],  where  i 
and  j  are  the  row  and  column  indices,  respectively,  and  Vvg  for  the  gradient  (dg/dv)' . 
Hence,  if  I  is  a  scalar  function,  dljdw  is  a  row  vector  and  Vw£  is  a  column  vector.  Using 
this  notation,  we  have 

£=",,  (3) 

aw      ou 

where  S  =  du/dw  satisfies  the  inhomogeneous  linear  ODE 

S  =  —S  +  —.  (4) 

ou        aw 

The  matrix  S  represents  how  the  state  u  varies  with  infinitesimal  changes  in  w.  In  control 
theory,  5  is  often  called  the  sensitivity  matrix,  and  equation  (4)  is  referred  to  as  the 
forward  sensitivity  equation. 

In  gradient  descent,  the  parameters  w  are  updated  proportional  to  the  negative  gradi- 
ent of  the  performance  functional  A  to  be  minimized,  i.e.,  Aw  =  -n  VWA,  where  n  is  the 
learning  rate.  Hence,  in  the  state-based  case  with  A  =  i(t/)  (updating  at  tj)  we  have 

Aw=  -nS(tfyVJ(tf). 

In  the  trajectory-based  cases,  performance  is  typically  measured  by  a  functional 

a=  [f  e(t)dt.  (5) 

Hence,  gradient  descent  modifies  w  according  to 

Aw  =  -n  f  '  Vwt(t)  dt=  -T)  I  '  S{t)'VJ(t)  dt.  (6) 

Jt0  Jto 

Notice  that  in  the  latter  case  we  could  also  accumulate  the  weight  changes  with  time  using 
w  =  -nS'VJ. 

Equation  (6)  is  just  a  (continuous-time  and  more  generally  written)  version  of  the 
recurrent  backpropagation  algorithm  as  proposed  e.g.  by  Robinson  h  Fallside  (1987), 


Williams  &:  Zipser  (1989a,  b).  A  variant  with  exponentially  weighted  instanteneous  errors 
is  dicussed  in  Gherrity  (1989).  If  equation  (5)  is  discretized  in  a  way  that  weight  updates 
are  performed  along  with  each  integration  step  of  equation  (4),  we  obtain  the  real-time 
recurrent  backpropagation  algorithm  of  Williams  &  Zipser,  see  also  Kuan,  Hornik,  & 
White  (1993). 

The  above  shows  that  the  exact  gradients  could  always  be  computed  forward  in  time 
based  on  a  direct  integration  of  the  forward  sensitivity  system.  Nevertheless,  they  might 
also  be  computed  or  approximated  differently;  in  the  sequel,  we  shall  discuss  some  exam- 
ples. 

As  in  neural  network  applications,  /i  typically  is  of  order  v2  >  v,  a  direct  integration  of 
the  forward  sensitivity  equations  is  rather  costly  (in  addition,  we  observe  that  we  basically 
solve  the  same  system  for  each  column  of  5).  Alternatively,  we  may  proceed  as  follows. 
Let  F(t)  be  a  fundamental  solution  of  the  associated  homogeneous  system 

F=^F,         F(t0)  =  I, 
ou 

and  write  K(t,r)  for  the  transport  matrix  F(i)F(r)-1.  Then  by  the  well-known  solution 

formula  for  linear  inhomogeneous  system  with  non-constant  coefficients,  we  obtain  that 

for  aR  to  <  s  <  t, 

S(t)  =  K(t,  s)S(s)  +  f  K(t,  r)^-(r)  ds.  (7) 

Js  OW 

In  particular,  if  the  initial  value  u(to)  does  not  depend  on  w,  S(to)  =  0,  hence  S(t)  = 
/£#(*,  r)  0//0tf(r)  dr  and 

ow  ou  Jto  ow 

As  G(t)  =  F(t)~l  solves  the  system  G  =  -Gdf/du,  we  can,  as  proposed  in  Sun,  Chen  & 
Lee  (1992),  set  up  two  auxiliary  systems 


G    =     -G  df/du,      G(t0)  =  /, 
H    =    Gdf/dw,         H(to)  =  0, 

and  use 

J£(0  =  a(t)H(t), 

ow 

where  a(t)  solves  the  linear  equation  a(t)G(t)  =  d£/du(t).  In  typical  NN  applications 
where  /x  =  0(v2)  and  df/dw  only  contains  0(u2)  nonzero  elements,  this  requires  only 
0(v3)  operations  per  time  step,  as  opposed  to  0(u4)  for  the  direct  integration. 


In  the  state-based  case  with  tf  =  oo,  it  may  not  be  necessary  to  follow  the  dynamics  of 
the  forward  sensitivity  equations.  In  fact,  assume  that  all  quantities  of  interest  converge 
for  t  — ►  oo  (the  case  of  "fixed  point  learning"),  we  find  that  5(oo)  solves  the  equation 

^(oo)5(oo)  + |^(oo)  =  0, 
au  aw 

i.e.,  5(oo)  =  —  (df/du(oo))~ldf/dw(oo),  and  hence 

dt.  .      at      fdf.   .yldft   , 

— (00)  =  — «-(oo)    —(00)         o-(oo). 
aw  au         \au        J      aw 

(Notice  however  that,  as  pointed  out  e.g.  in  Pearlmutter  ( 1990),  the  mapping  w  •-*>  u(oo;  w) 
is  not  necessarily  continuous  for  all  w  of  interest,  even  for  globally  convergent  systems.  For 
such  w,  the  argument  clearly  cannot  be  applied.)  Of  course,  the  efficient  way  of  computing 
the  above  expression  is  to  first  obtain  the  solution  u(oo)  of  the  ("small")  linear  system 
v df/du(oo)  =  d£/du(oc)  and  then  compute  —v(oo)df/dw(oo). 

To  use  this  relation,  one  could  first  run  the  system  (1)  up  to  some  tj  which  is 
large  enough  to  get  u(tf)  close  to  the  equilibrium  u(cc)  (if  this  is  asymptotically  sta- 
ble, convergence  is  exponentially  fast)  and  then  approximate  u(oc)  by  the  solution  of 
vdf/du(tj)  =  dt/du(tf).  When  applied  to  (discrete-time)  feedforward  architectures 
where  convergence  occurs  after  a  finite  number  of  time  steps,  this  reduces  to  ordinary 
backpropagation,  as  the  linear  system  can  then  be  solved  directly  by  backsubstitution  due 
to  the  block  triangularity  of  df/du. 

Alternatively,  u(oo)  could  be  computed  by  a  relaxation  method  as  the  equilibrium  of 
the  system 

v  =  v^  +  —.  8) 

au      au 

In  this  case,  the  approximate  gradients  are  computed  by  strict  forward  propagation.  When 

applied   to  the  additive  model  (2),  one  obtains  exactly  the  recurrent  backpropagation 

algorithm  discussed  in  Pineda  (1987).    Almeida  (1987)  gives  a  similar  algorithm  for  a 

slightly  different  model. 

In  the  trajectory-based  case,  we  can  utilize  the  explicit  solution  formula  (7)  of  the 

forward  sensitivity  equations  (again  assuming  that  S(to)  =  0)  to  obtain  by  a  simple 

change  of  the  order  of  integration  that 

£-/*'fw/v.>&*>4*«r«of*M*. 

dw      Jt0    au      \Jt0  aw  )  Jto  aw 

where 

*/  di 


ri  at 

v(t)  =   /      —(s)K(s,t)ds 
Jt      au 


is  the  solution  of 

df      di 

v=  ~v^ a~ 

ou      ou 

with  the  final  condition  v(tf)  =  0.    This  is  sometimes  referred  to  as  the  adjoint  (or 

backwards)  sensitivity  system. 

Hence  in  this  case,  we  can  also  compute  the  gradient  by  first  running  the  system  (1) 
forward  in  time  and  then  computing  v(t)  along  with  ftfv(s)df/dw(s)ds  backwards  in 
time.  When  applied  to  the  standard  additive  model  (2),  this  method  gives  the  algorithm 
suggested  by  Pearlmutter  (1989),  cf.  also  Toomarian  &  Barhen  (1991).  When  applied 
to  corresponding  discrete-time  systems,  one  obtains  the  "backpropagation  through  time" 
algorithm  of  Rumelhart,  Hinton  &  Williams  (1986),  and  v(t)  can  be  interpreted  as  the 
vector  of  back-propagated  signals.  For  more  details,  see  Baldi  (1993). 

One  potential  drawback  of  this  method  is  that  the  integration  backwards  in  time  makes 
it  necessary  to  store  the  whole  trajectory  u(t)  for  t0  <  t  <  tj  before  the  computation  of  v 
and  dt/dw  can  begin.  This  need  for  a  potentially  infinitely  large  stack  can  be  avoided  by 
solving  the  adjoint  sensitivity  equations  forward  in  time  along  with  an  auxiliary  system. 
Let  v(t)  =  ftodt/du(s)K(s,t)  ds  be  the  solution  of  the  adjoint  system  with  zero  initial 
condition.  Then  clearly,  v(t)  =  —v(tf)K(tf,t)  -f  v(t).  Hence, 

dA  -u  xev„   *   fh  *us-idfuM*       ftf  -t*?f 


a  A  rf  of  rli  of 

-  =  -i{tj)F{t))J  m-iJ-(t)dt+j  mJ.(t) 


Clearly,  both  integrals  on  the  right-hand  side  can  be  accumulated  forward  in  time  and 
combined  at  time  tj  with  the  first  one  premultiplied  by  —  v(tf)F(tf);  however,  it  is  not 
clear  whether  this  can  be  implemented  in  a  way  which  is  more  efficient  than  a  direct 
forward  propagation  based  on  the  forward  sensitivity  equations. 

Among  the  above  algorithms,  none  is  uniformly  better  than  all  others  for  all  possible 
applications.  We  have  already  seen  that  backpropagation  through  time  scales  poorly  with 
the  number  of  time  steps  (the  pattern  lengths)  before  updating.  On  the  other  hand,  its  sin- 
gle time  steps  are  typically  quicker  than  those  of  algorithms  based  on  forward  propagation. 
For  a  more  detailed  discussion  of  this  complexity  issue,  see  e.g.  Baldi  (1993). 

3      Examples  and  Extensions 

Thus  far,  gradient-based  learning  has  been  studied  explicitly  in  the  framework  of  contin- 
uous time  dynamical  systems  only.  Equivalently,  we  could  study  discrete-time  systems  of 
the  form 

ut+i  =  g(t,ut;w).  (9) 


There  is  a  one-to-one  correspondence  between  (1)  and  (9).  If  we  discretize  (1)  on  a  grid 
of  mesh  A  and  write  uk  =  u(to  -f  &A),  we  obtain 

Uk+i  «  uk  +  Af(t0  +  fcA,  uk;  w)  :=  g(k,  uk;  w). 

In  particular,  if  to  =  0  and  A  =  1,  we  have  the  correspondence  #(£,  u;w)  =  u  +  f(k,  u;  w). 
Notice  however  that  this  correspondence  is  only  formal;  the  two  systems  can  behave  quite 
differently  unless  the  unit  of  time  for  which  A  =  1  is  small  enough  to  make  difference 
ratios  good  approximations  to  derivatives. 

As  a  first  example,  consider  the  fully-connected  discrete  time  model 

ut+i  =  a(Aut  +  Bxt)  (10) 

considered  in  Williams  &  Zipser  (1989a,  b),  where  a  performs  componentwise  squashing, 
i.e.,  a(v)  =  [<Ti(vi)].  Writing  W  =  [A,  B],  z  =  [u',x']'  and  net  =  Au  +  Bx  =  Wz, 
we  can  more  compactly  write  this  system  as  ut+i  =  cr(nete).  Letting  w  =  vec(iy),  the 
corresponding  continuous-time  system  is  it  —  —u  +  cr(net)  =  f(t,u;w),  for  which 

-J-  =  -I  +  Da(net)A,         -J-  =  z'  %  Dcr(net), 
on  aw 

where  Dai  is  the  ordinary  derivative  of  <rt,  Da  is  the  matrix  with  diagonal  entries  Da{  and 

zero  off-diagonal  entries,  and  %  denotes  the  Kronecker  product.   (For  arbitrary  matrices 

P  and  Q,  their  Kronecker  product  P  ®  Q  is  obtained  from  P  by  replacing  each  entry 

Pij  of  P  with  the  matrix  pijQ.    Notice  that  if  z  is  a  column  vector,  then  P(z'  ®  /)  = 

(1  ®  P){z'  <S>  I)  =  z'  ®  P,  where  the  last  step  follows  from  the  product  rule  for  Kronecker 

products;  for  more  details,  cf.  e.g.  Magnus  &  Neudecker  (1988).) 

Assume  for  simplicity  that  all  units  are  visible  and  compared  to  targets  y  at  each  time 

step.  With  the  usual  quadratic  performance  measure  it(u)  —  (1/2)  !C«(y»*,<  —  **i)  ,  we  have 

dit/du  =  —e't,  where  et  =  yt  —  ut  is  the  network  "error"  at  time  t.  For  the  corresponding 

continuous-time  system,  we  obtain 

aw 
S    =     -S  +  Da(net)(AS  +  z' ®  I). 

Discretizing  this  forward  sensitivity  system  with  mesh  A  =  1,  we  thus  obtain  the  discrete- 
time  system 

St+i  =  Da{nett)(ASt  +  z't®  /), 


which  is  exactly  the  recurrent  backpropagation  algorithm  of  Williams  &  Zipser.  Compo- 
nentwise, the  above  reads  as 


sh,t+i  =  D<Ti(netitt)  r^2_  WijS3klt  +  8ikzlit\  , 


where  s\lt  is  dui/dwki  at  time  t,  and  8a-  the  Kronecker  delta.  Extensions  which  include 
teacher  forcing  can  be  obtained  similarly. 

The  above  system  can  be  used  to  implement  a  variety  of  different  weight  updat- 
ing schemes.  In  particular,  we  can  run  the  system  from  time  to  to  tf  and  use  Aw  = 
V^2tl+i  S'tet  (tne  trajectory-based  case)  or  Aw  =  S't  etf  (the  state-based  case).  A  special 
case  is  obtained  by  updating  the  weights  at  each  time  step  (i.e.,  tf  =  t0  +  I).  This  gives 
the  approximate  gradient  descent  scheme 

et    =    yt-  «t, 
wt+l     =     wt  +  T]tS'tet, 
St+i     =     Dcr{^it)(AtSt  +  z't®I), 

where  here  and  in  what  follows  a  variable  is  written  with  a  "hat"  symbol  if  it  is  evaluated  at 
the  parameter  estimates  w.  This  is  just  the  real-time  recurrent  backpropagation  algorithm 
of  Williams  &  Zipser. 

As  a  second  example,  let  us  apply  the  idea  of  fixed  point  learning  to  the  discrete  time 
system  (10).  The  relaxation  equation  (8)  for  the  corresponding  continuous-time  system  is 

v  =  —  v  +  v  Da(net)A  —  e', 
which  discretizes  into 

vt+i  =  vtDa(nett)A  -  e't. 

This  gives  the  approximation 

— ^  «  uf— -  =  vt(z[  (g>  Da(nett))  =  z\  ®  vtD(r(nett), 
aw  aw 

which  componentwise  reads  as  dioo/dw^i  ss  Vk^Dcr^netk^zij-    This  is  a  hebbian  rule 

with  the  presynaptic  activity  term  z  and  the  postsynaptic  term  —  v  Da(net);  of  course, 

some  special  machinery  is  required  to  compute  the  latter.     This  interpretation  might 

suggest  a  modification  which  is  based  on  a  relaxation  method  for  the  equilibrium  value 

Poo   =   — Uqo  -Do^netoo)  of  the  postsynaptic  term.     Observing  that  p^  solves  the  linear 

equation  p  =  (pA  +  e00)Ccr(net00),  we  can  approximate  it  through 

pt+i  =  (ptA  +  e't)Da(nett) 


and  use 

— —  «  -z't®pt. 
aw 

An  on-line  version  based  on  these  fixed-point  approximations  with  weight  updates  at  each 

time  step  is  thus 

e<     =     Vt-  ut, 

Wt+l       =       Wt  +  T}tZtpt, 

Pt+i     =     (PtAt  +  et)D<r(neit), 

which  is  the  recurrent  backpropagation  algorithm  of  Pineda  (1987). 

A  more  general  system  than  (10)  is  studied  in  Kuan,  Hornik  &  White  (1993).  Their 
starting  points  are  networks  with  a  single  hidden  layer  and  delayed  internal  feedbacks. 
Such  networks  can  be  described  by  the  equations 

Wt+i     =     g{ut,xt;w) 

ot     =     h(Ca(Aut  +  Bxt))), 

where  w  =  [vec(A)',  vec(B)',  vec(C)']'  is  the  vector  of  network  connection  strengths.  Clearly, 
a(Aut  +  Bxt)  is  the  vector  of  hidden  unit  activations  at  time  t.  When  g  =  0  such  that 
ut  =  0,  the  above  reduces  to  a  single  hidden  layer  feedforward  network.  If  ut  =  ot-i, 
we  obtain  the  Jordan  network  with  output  feedbacks,  cf.  Jordan  (1986,  1992);  if  ut  = 
a(Aut^i  +  Bxt-\)  is  the  lagged  vector  of  hidden  unit  activations,  we  obtain  the  Elman 
network  with  hidden-unit  activations  feedbacks,  see  Elman  (1990).  If  h  —  id,  C  —  I  and 
ut  =  Oj_!  (such  that  the  system  and  output  equations  are  identical),  we  obtain  the  model 
(10)  considered  earlier,  which  could  thus  also  be  termed  a  Jordan  network  with  hidden 
and  output  layer  collapsed. 

The  above  class  of  discrete  time  neural  networks  is  particularly  attractive  in  appli- 
cations in  dynamic  environments.  For  example,  Jordan  (1986,  1992)  is  concerned  with 
the  control  and  learning  of  robot  movements;  hence  the  Jordan  network  was  originally 
constructed  so  that  it  can  "memorize"  previous  positions  of  the  robot.  More  generally, 
the  Jordan  network  is  particularly  suitable  when  the  serial  order  of  outputs  is  important. 
On  the  other  hand,  Elman  (1990)  is  interested  in  learning  sequential  pattern  in  linguis- 
tics, such  as  the  syntactic  or  semantic  features  of  words;  the  Elman  network  was  then 
constructed  to  "memorize"  internal  states,  i.e.,  hidden  unit  activations.  Thus,  hidden- 
unit  activation  feedbacks  in  Elman  networks  in  fact  encode  the  temporal  properties  of 
sequential  inputs;  this  is  extremely  useful  in  many  dynamic  applications. 


The  exact  functional  form  of  the  output  equations  is  not  important  for  the  derivation 
of  the  gradient  computations.  Hence,  more  generally,  consider  a  system  ut+i  =  g(ut,xt;  w) 
along  with  a  general  output  equation 

ot  =  h(ut,xt;w). 

The  basic  updating  equations  could  now  easily  be  obtained  by  applying  the  general  princi- 
ple to  an  augmented  state  vector  ut  =  [o't_1,u't]'  (note  that  the  time  indexing  differs  from 
the  previously  considered  models).  Alternatively,  and  more  conveniently,  we  can  proceed 
as  follows.  If  the  performance  measure  lt  =  lt{ot)  is  a  function  of  the  outputs  which  in 
turn  are  computed  from  the  states  of  the  system  and  the  system's  parameters,  we  have 

di_       (Hdo_       dtfdhfa       dh\ 
dw       dodw       do  \dudw       dwj 

For  the  typical  performance  measure  lt  =  (1/2)  J3ie?t>  where  et  =  yt  —  ot  with  yt  the 
target  pattern  employed  at  time  t,  we  have  di/do  —  —e'  and  hence 

dt        ,1#        lM     do     dhc     dh 

—  =  -e'M,         M  =  —  =  —  S  +  -T-. 
ow  aw       ou         ow 

The  updating  rule  for  the  sensitivities  S  =  du/dw  can  e.g.  be  obtained  discretizing  the 

corresponding  continuous-time  forward  sensitivity  equations  which  gives 

ou  aw 

As  usual,  the  above  can  be  combined  with  a  variety  of  weight  updating  schedules.  If  the 
weights  are  updated  at  each  time  step,  we  obtain 

et     =     Vt-  h(ut,xt;wt) 

dh ,  ~  -  x n       dh  „ 

Mt     =     ^r-{ut,xt;wt)St  +  —{ut,xt;wt) 
Ou  ow 

wt+i     =     wt  +  nt  M[et 
ut+i     =    g{ut,xt\wt) 

c  dy<-      mc  ,  ^/-      -  \ 

St+\     =    -z-{uuxt;wt)St  +  —{uuxt;wt), 
ou  ow 

which  is  the  on-line  recurrent  backpropagation  algorithm  considered  in  Kuan,  Hornik,  & 
White  (1993).  Clearly,  if  u  is  not  present  in  the  network,  this  algorithm  simply  reduces 
to  the  standard  BP  algorithm  for  feedforward  networks.  In  the  network  (10),  ut  —  ot-\. 
Hence,  Mt  =  Su  so  that  the  equation  for  Mt  is  redundant.  The  on-line  recurrent  BP  algo- 
rithm then  becomes  (a  more  general  variant  of)  the  real-time  recurrent  backpropagation 
algorithm  of  Williams  k,  Zipser. 
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All  algorithms  discussed  thus  far  are  based  on  (approximate)  gradient  descent  and 
may  thus  exhibit  very  poor  convergence  behavior.  For  feedforward  networks,  Kuan  & 
White  (1993a)  have  shown  that  a  stochastic  Newton  algorithm,  which  takes  second-order 
derivatives  into  account  as  well,  is  computationally  and  statistically  more  efficient  than 
the  (gradient  descent)  BP  algorithm  and  in  fact  asymptotically  equivalent  to  the  nonlinear 
least  squares  estimator  under  very  general  conditions.  Their  Newton  algorithm  is  clearly 
motivated  by  the  Newton  method  in  numerical  optimization  and  is  analogous  to  that  in 
the  system  identification  literature;  see  e.g.  Ljung  &  Soderstrom  (1983).  Analogously,  the 
following  Newton-type  algorithm  is  a  natural  extension  of  the  (real-time)  recurrent  BP 
algorithm  (Kuan,  1993): 

et     =     Vt-  h(ut,xt;wt),  (11) 

Mt     =     —  (ut,xt;wt)St  +  ^-(ut,xt;wt)  (12) 

ou  aw 

Ht+1     =     Ht  +  Vt(M{Mt-Ht),  (13) 

wt+i     =     wt  +  r)tH-+xM'tet  (14) 

iit+i     =     g(ut,xt;wt)  (15) 

St+i     =     -^(ut,xt;wt)St  +  -z-(ut,xt;wt),  (16) 

ou  aw 

Again,  if  u  is  not  present  in  the  network,  the  recurrent  Newton  algorithm  simply  reduces 
to  the  Newton  algorithm  for  feedforward  networks.  In  contrast  to  the  recurrent  BP  al- 
gorithm, the  recurrent  Newton  algorithm  contains  an  additional  updating  equation  (13) 
which  recursively  updates  the  outer  product  of  M[  to  approximate  the  Hessian  matrix  and 
provide  an  approximate  Newton  direction  for  (14).  This  approximation  is  quite  common 
in  engineering,  cf.  e.g.  Ljung  &  Soderstrom  ( 1983);  in  particuar,  it  makes  it  easier  to  ensure 
that  the  estimates  of  the  Hessians  remain  positive  definite  and  hence  invertible  thoughout 
the  algorithm. 

There  are  some  difficulties  associated  with  this  algorithm.  First,  it  is  not  "local" 
because  (14)  involves  matrix  inversion.  Second,  it  may  not  follow  a  correct  search  direction 
if  Ht  is  not  a  positive  definite  matrix.  Let  t/t  —  (1  —  ■qt)rft_\ / rjt  and  Pt  =  Tjt-iHt-  Then 
by  a  matrix  inversion  formula, 

A+i  =  -(Pt-  PtMl(MtPtM't  +  utI)-lMtPt)  . 

For  the  network  with  a  single  output,  Mt  is  a  row  vector,  and  hence 
p      =  I  U  _    PtMjMtPt  \ 
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which  does  not  involve  matrix  inversion.  For  networks  with  multiple  outputs,  we  can 
follow  Bierman  (1977)  and  compute  Pt+i  using  a  sequence  of  single-output  algorithms 
in  which  no  matrix  inversion  is  needed.  Kuan  (1993)  thus  proposes  a  modified  Newton 
algorithm  which  contains  (11),  (12),  (15),  and  (16)  in  the  original  version  but  substitutes 
the  updating  equations  below  for  (13)  and  (14): 

^(°i     =     A,  (17) 

p(i-i)^,/  ™.   p(i-i) 
M+i     -    rt+i  .      -r,-n  . ,  '        J  — A» •••»*»  U°J 


™i,*^!+i    ™5,i  +  ^ 


A+i     =    P&M,  (19) 

A+1    =    {*«'  ^W>o,  (20) 

(   P(+i+At+i,     otherwise, 
u;t+1     =     u>t  +  Pt+1Mt'ef,  (21) 

where  t  is  the  number  of  output  units,  mht  is  the  _/-th  row  of  M(,  e  is  a  small  positive 
constant,  and  At+\  is  chosen  to  make  Pt+i  —  (I  >  0,  i.e.,  a  positive  semidefinite  matrix. 
Note  that  in  (17)— ( 19),  Pt+\  is  updated  as  each  output  unit  is  added  sequentially,  and 
that  (20)  implements  a  correction  ensuring  Pt  to  be  a  p.s.d.  matrix. 

In  applications,  one  usually  cannot  guarantee  a  priori  that  the  parameter  estimates  do 
not  diverge  to  infinity.  This  is  typically  enforced  by  implementing  some  projection  device 
7T.  I.e.,  if  9  is  the  vector  of  parameters  updated  by  the  algorithm  under  consideration 
(such  that  $  =  w  in  the  recurrent  backpropagation  and  0  —  [w',vec(P)']'  in  the  Newton 
algorithm)  and  0  is  a  compact  subset  of  the  parameter  space,  then  ir(d)  =  9  for  9  €  0  and 
n(9)  6  0  for  9  £  0.  The  actual  algorithm  is  then  obtained  by  applying  ir  to  the  estimates 
9t  after  each  updating  step.  In  practice,  one  would  of  course  like  to  choose  truncation 
bounds  large  enough  so  that  the  behavior  of  the  algorithm  is  virtually  unaffcted  by  the 
projection  device.  However,  this  is  not  always  possible,  as  suggested  by  the  following 
convergence  analysis. 

4      Convergence  Analysis 

A  rigorous  and  satisfactory  analysis  of  the  behavior  of  the  learning  algorihms  presented 
thus  far  under  the  presentation  of  a  sequence  of  learning  patterns  is  clearly  extremely 
hard,  if  not  impossible.  Suppose  that  the  patterns  are  drawn  at  random.  Then  a  learning 
algorithm  generates  a  sequence  of  estimates  9t  according  to  some  stochastic  discrete-time 
system.  In  typical  cases,  it  is  possible  to  relate  the  analysis  of  this  system  to  the  asymptotic 
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behavior  of  an  associated  deterministic  continuous-time  system,  the  so-called  associated 
ODE,  which  is  obtained  by  "suitably  averaging  over  all  patterns".  Such  a  relation  can  be 
established  for  small  constant  learning  rates  77  in  the  sense  of  weak  convergence  of  random 
processes  as  77  — ►  00  (Kushner,  1984)  as  well  as  for  learning  rates  r)t  which  decay  to  zero 
at  a  suitable  rate  (Kushner  &  Clark,  1978).  The  latter  methodology  has  been  used  in 
particular  to  analyze  the  standard  BP  and  Newton  algorithms  for  feedforward  networks, 
see  e.g.  Kuan  h  White  (1993a).  In  the  following,  a  general  theorem  due  to  Kuan  h 
White  (1993b)  is  given  and  applied  to  the  recurrent  BP  and  Newton  algorithms. 
Consider  a  general  learning  algorithm  of  the  form 

Ot+i     =     *{6t  +  mQ(rt,ztJt)), 

rt+i     =    p(rt,ZtJt),  (22) 

where  ir  is  some  suitable  projection  device  on  the  parameter  space  and  z  is  an  exogeneous 
variable  (i.e.,  the  sequence  of  training  patterns  zt  does  not  depend  on  6).  Formally,  all 
algorithms  considered  in  this  paper  are  of  the  above  form,  as  the  patterns  z  could  be 
continuous-time  functions  of  length  L(z).  In  what  follows,  we  shall  however  assume  that  z 
is  a  finite-dimensional  vector;  this  still  covers  all  cases  where  the  patterns  are  finite-length 
temporal  sequences. 

For  a  continuous  vector  field  u(-)  let  the  vector  field  7f[i/(-)]  be 

*[,(»] -lim*^"^-',        »€6; 

this  technical  structure  is  used  to  characterize  the  limiting  ODE  when  the  projection 
device  %  is  imposed.  Let 

§0{r)  =  (^r1) §t  +  (Llr) §t+u    T  e  [r"r'+i)' 

where  To  =  0  and  for  t  >  1,  rt  =  $3|=o  %,  and  let  the  left  shifts  9l  of  6°  be  given  by 

e\T)  =  e°(rt  +  t). 

In  other  words,  9*  is  the  process  obtained  by  linearly  interpolating  the  sequence  0t,  #t+i,  • . ., 
0t+fc,...  at  the  time  knots  0,7?t+i, . .  .,r;t+1  H +  r]t+k,.... 

Consider  the  following  conditions. 
[Al]   {zt}  is  a  sequence  of  bounded,  i.i.d.  random  variables. 
[A2]  The  function  Q  is  continuously  differentiate  of  order  one  in  all  arguments. 


13 


[A3]  The  function  p  is  continuously  differentiable  of  order  two  in  all  arguments.  Also, 
for  each  (z,#),  p  is  a  contraction  mapping  in  the  first  argument  r,  i.e.,  \p(r\,z,9)  - 
p(r2,z,9)\  <  c\ri  -  r2\  with  c  <  1. 

[A4]  {r/t}  is  such  that  J2t  Vt  —  °o  and  Yi,t  Vt  <  °°° 

[A5]  For  each  9  €  0,  Q(#)  :=  lim^oo  TE(Q(rt(9),zt,9))  exists,  where  0  is  the  image  of  k 
and  rt(9)  is  generated  by  the  recursion  r<+i  =  p(rt,zt,9). 

These  conditions  are  stated  for  convenience;  formal  statements  can  be  found  in  Kuan, 
Hornik  &  White  (1993)  or  Kuan  &  White  (1993b).  The  i.i.d.  assumption  in  [Al]  is  in  fact 
much  stronger  than  needed;  in  particular,  the  result  below  continues  to  hold  when  {zt}  is 
a  stationary  and  ergodic  sequence  or  a  weakly  dependent  sequence.  The  conditions  [A2] 
and  [A3]  are  "smoothness"  conditions;  some  of  the  continuous  differentiability  conditions 
can  be  replaced  by  the  Lipschitz  continuity  conditions.  The  condition  [A4]  specifies  the 
order  of  learning  rate;  for  example  r)t  of  order  l/t  satisfies  this  condition.  Finally,  [A5]  is 
needed  to  define  the  associated  ODE.  Note  that  the  other  conditions  also  imply  that  Q  is 
continuous. 

The  following  result  is  established  in  Kuan  &  White  (1993b). 

THEOREM.  Suppose  that  [A1]-[A5]  hold. 

(a)  Let  9(-)  be  the  limit  of  a  convergent  subsequence  of  {#*(•)}■  Then  with  probability 
one  (w.p.l),  §(■)  satisfies  the  ODE  9  =  it[Q(9)]. 

(b)  Let  0*  be  the  set  of  locally  asymptotically  stable  equilibria  in  0  for  the  ODE  in  (a) 
with  domain  of  attraction  V{Q").  If  Q  C  P(0*),  then  9t  —>  0*  w.p.l. 

(c)  If  0  %  V{0)*),  but  9t  enters  a  compact  subset  of  V(Q*)  infinitely  often  w.p.l,  then 
9t  — ►  0*  w.p.l.  If  further,  0*  contains  only  finitely  many  points,  then  9t  —*■  9* 
w.p.  1  for  some  9*  €  0* .  This  convergence  is  conditional  on  the  initial  value  of  the 
algorithm  and  the  realization  of  9t. 

Part  (a)  of  this  theorem  says  that  9t  essentially  follows  the  solution  path  of  the  ODE 
9  =  Q(9)  in  0.  Note  that  the  projection  device  is  not  effective  unless  9  is  on  the  boundary 
of  0  and  the  vector  field  Q(9)  points  outside  of  0.  Then  by  the  local  asymptotic  stability 
of  the  ODE,  §(•),  hence  0t,  converges  to  the  set  of  locally  asymptotically  stable  equilibria 
(part  (b)  and  (c)).  If  there  are  only  finitely  many  such  equilibria,  the  last  part  of  the 
theorem  asserts  that  9t  converges  to  one  of  them;  cycling  between  equilibria  is  not  possible. 
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By  letting  zt  =  [x't,y't]'  and  rt  =  [u't,  vec(5't)/]/,  we  can  apply  the  above  theorem  to 
the  recurrent  BP  algorithms  of  Williams  &  Zipser  (1989)  and  Kuan,  Hornik  &  White 
and  to  the  Newton  algorithm  of  Kuan  (1993).  Hence,  the  convergence  behavior  of  these 
algorithms  can  be  obtained  as  corollaries.  In  these  applications,  it  is  readily  seen  that 
Q(8)  =  0  implies  the  first  order  conditions  for  the  minimization  of  the  asymptotic  mean 
square  error  E{9)  —  lim^oo  IE]Tt(y,)(  —  o,)t(0))2.  Hence,  a  locally  asymptotically  stable 
equilibrium  of  the  associated  ODE  is  just  a  local  minimum  of  E.  From  the  above  theorem, 
we  can  thus  infer  that  these  algorithms  converge  to  a  local  asymptotic  MSE  minimizer 
with  probability  one. 

The  "smoothness"  conditions  imposed  here  are  not  very  restrictive;  most  of  the  net- 
work activation  functions  adopted  in  applications,  such  as  the  logistic  and  hyperbolic 
tangent  functions,  satisfy  these  conditions.  On  the  other  hand,  the  contraction  mapping 
condition  in  [A3]  is  crucial  and  is  not  satisfied  automatically.  In  fact,  it  is  easily  seen 
that  the  underlying  discrete-time  system  rf+1  =  p(rt,zt,w)  could  "explode"  if/?  is  not  a 
■contraction  mapping  in  r,  because  the  effects  of  zt  could  accumulate  very  rapidly.  Even  if 
p  is  a  bounded  (or  squashing)  function,  this  condition  is  still  needed;  otherwise,  rt  could 
approach  the  upper  or  lower  bound  of  p  within  a  very  short  period  of  time,  which  would 
"exaggerate"  the  behavior  of  the  internal  states  and  invalidate  the  learning  results.  If  p 
is  a  contraction  mapping,  then  the  system  implements  an  exponentially  forgetting  mem- 
ory of  the  data  sequence  and  is  essentially  well-behaved.  In  fact,  we  notice  that  for  the 
recurrent  backpropagation  and  Newton  algorithms,  the  contraction  mapping  property  is 
(essentially  equivalent  to)  requiring  that  all  the  eigenvalues  of  dg/du  be  less  than  one  in 
absolute  value  (uniformly  in  x  and  w),  which  not  only  ensures  proper  behavior  of  the 
sequence  of  system  states,  but  also  is  necessary  for  the  stability  of  the  forward  sensitivity 
system,  as  has  already  been  pointed  out  by  Almeida  (1987). 

As  an  example,  let  us  again  consider  the  Jordan  and  Elman  networks  where 

ot  =  h(Ca(Aut+  Bxtj) 

and  ut  is  the  vector  of  lagged  output  or  hidden  layer  activations,  respectively.  Let  Mh 
and  Ma  be  the  sup  of  the  derivatives  of  the  (usually  identical)  output  and  hidden  layer 
activation  functions,  respectively.  It  is  readily  seen  that  the  contraction  mapping  property 
for  the  Jordan  network  is  satisfied  when 

Ej  lc«ill°i*l  <  {MhMa)~l 
for  all  i  and  k.  Similarly,  we  find  that  for  the  Elman  network,  it  suffices  that 

U\\f<  1/Af„ 
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where  ||.4||f  is  the  Frobenius  norm  of  A  (\\A\\2F  is  the  trace  of  A' A,  i.e.  the  sum  of  the 
squares  of  the  entries  of  A). 

To  ensure  proper  convergence  behavior,  the  connection  weights  must  be  suitably  con- 
strained during  the  learning  process.  Note  that  in  the  Jordan  network  this  is  a  restriction 
on  the  hidden-to-output  weights  and  the  internal  feedback  weights  simultaneously.  In  the 
Elman  network,  however,  only  recurrent  connections  are  subject  to  the  constraint  so  that 
the  feedforward  part  of  the  network  is  not  affected.  Thus,  as  far  as  the  representation  ca- 
pability of  a  network  is  concerned,  the  Elman  network  is  clearly  superior  since  less  network 
connection  weights  are  subject  to  constraints.  It  is  straightforward  to  verify  that  some 
sufficient  conditions  ensuring  the  contraction  mapping  property  in  the  Elman  network  are: 

•  \aij\  <  4/v  for  all  i,  j  if  each  cr,  is  the  logistic  function, 

•  |a,-j|  <  l/v  for  all  i,j  if  each  cr,  is  the  hyperbolic  tangent  function, 

where  v  is  the  number  of  hidden  units. 

These  restrictions  are  practically  important,  as  demonstrated  in  the  simulation  results 
of  Kuan,  Hornik  &  White  (1993)  and  Kuan  (1993);  some  of  their  results  are  summarized 
in  Table  1.  These  simulations  find  that 

MSE  of  the  Newton  algorithm  with  the  constraint 

<  MSE  of  the  Newton  algorithm  without  the  constraint 

<  MSE  of  the  BP  algorithm  with  the  constraint 

<  MSE  of  the  BP  algorithm  without  the  constraint. 

In  fact,  the  MSEs  of  the  BP  algorithm  without  the  constraint  are  much  larger  than  those 
of  other  algorithms;  the  Newton  algorithm  without  the  constraint  behaves  unstably  and 
may  produce  very  large  errors  during  the  learning  period.  On  the  other  hand,  both 
the  BP  and  Newton  algorithms  with  the  constraint  are  well  behaved,  but  the  Newton 
algorithm  results  in  much  lower  MSE  and  converges  much  faster  than  the  BP  algorithm. 
An  interesting  finding  is  that  adding  more  hidden  units  need  not  result  in  lower  MSE  if 
a  learning  algorithm  is  used  without  the  constraint,  whereas  the  MSEs  of  the  algorithm 
with  the  constraint  typically  decrease  when  the  number  of  hidden  units  increases.  This 
suggests  that  the  "learned"  network  resulted  from  an  algorithm  without  the  constraint 
could  be  very  misleading. 
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Table  1.  Summary  of  Simulation  Results. 


Model 

V 

Newton  with  Const. 

Newton  w/o  Const. 

BP  with  Const. 

BP  w/o  Const. 

Average 
MSE 

Last 
MSE 

Average 
MSE 

Last 
MSE 

Avg. 
MSE 

Last 
MSE 

Avg. 
MSE 

Last 
MSE 

NLMA 

4 

5 
6 

N/A 
N/A 
N/A 

N/A 
N/A 
N/A 

N/A 
N/A 

N/A 

N/A 
N/A 

N/A 

1.280 
1.273 
1.268 

1.253 
1.241 
1.229 

1.332 
1.338 
1.339 

1.307 
1.306 
1.304 

BL1 

4 
5 
6 

N/A 
N/A 
N/A 

N/A 
N/A 
N/A 

N/A 
N/A 
N/A 

N/A 
N/A 
N/A 

1.399 
1.392 

1.375 

1.366 
1.351 
1.325 

1.456 
1.449 
1.462 

1.429 
1.416 
1.423 

BL2 

4 

5 
6 

1.825 
1.820 
1.802 

1.824 
1.811 
1.789 

1.900 
1.902 
1.924 

1.839 
1.852 
1.867 

2.181 
2.139 
2.109 

2.154 
2.111 
2.078 

2.547 
2.579 
2.634 

2.533 
2.564 
2.615 

HM 

4 
5 
6 

0.125 
0.101 
0.079 

0.125 
0.098 
0.079 

0.176 
0.159 
0.152 

0.162 
0.142 
0.158 

0.329 
0.324 
0.302 

0.324 
0.319 
0.296 

0.491 
0.527 
0.536 

0.488 
0.524 
0.530 

The  networks  are  Elman  networks  with  2  input  and  v  hidden  units  and  a  single  output 
unit;  inputs  and  targets  at  time  t  are  [yt-2,yt-\]'  and  yt,  respectively,  with  the  data 
generating  processes  for  {yt}  being: 

NL  MA  (nonlinear  MA)  yt  =  -0.3€t-i  +  0.2t,_2  +  0.4et_i£(_2  -  0.25€?_j  +  tt, 

BL  1  (bilinear  1)  yt  =  0.5  -  0.4t/,_!  +  QAyt-iet-i  +  ft, 

BL  2  (bilinear  2)  yt  -  0Ayt-\  -  0.3yt_2  +  0.5yt-i(t-i  +  *t, 

HM  (Henon  map)  yt  =  1  +  0.3yt_2  -  1.4y£_1?         y0  =  -u,y_i  =  0.5m, 

where  the  €t  are  independent  Ar(0, 1)  and  u  is  uniform  on  [0, 1]. 
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